========================================================

##   user_id product_id gender   age occupation city_category
## 1 1000001  P00069042      F  0-17         10             A
## 2 1000001  P00248942      F  0-17         10             A
## 3 1000001  P00087842      F  0-17         10             A
## 4 1000001  P00085442      F  0-17         10             A
## 5 1000002  P00285442      M   55+         16             C
## 6 1000003  P00193542      M 26-35         15             A
##   stay_in_current_city_years marital_status product_category_1
## 1                          2              0                  3
## 2                          2              0                  1
## 3                          2              0                 12
## 4                          2              0                 12
## 5                         4+              0                  8
## 6                          3              0                  1
##   product_category_2 product_category_3 purchase
## 1                 NA                 NA     8370
## 2                  6                 14    15200
## 3                 NA                 NA     1422
## 4                 14                 NA     1057
## 5                 NA                 NA     7969
## 6                  2                 NA    15227

Dataset of 537 577 observations about the transactions made on black Friday in a retail store

Univariate Plots Section

## [1] 537577     12

the dataset contain 537577 observations and 12 variables

the structure of the dataset

## 'data.frame':    537577 obs. of  12 variables:
##  $ user_id                   : int  1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
##  $ product_id                : Factor w/ 3623 levels "P00000142","P00000242",..: 671 2375 851 827 2733 1830 1744 3319 3597 2630 ...
##  $ gender                    : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
##  $ age                       : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
##  $ occupation                : int  10 10 10 10 16 15 7 7 7 20 ...
##  $ city_category             : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
##  $ stay_in_current_city_years: Factor w/ 5 levels "0","1","2","3",..: 3 3 3 3 5 4 3 3 3 2 ...
##  $ marital_status            : int  0 0 0 0 0 0 1 1 1 1 ...
##  $ product_category_1        : int  3 1 12 12 8 1 1 1 1 8 ...
##  $ product_category_2        : int  NA 6 NA 14 NA 2 8 15 16 NA ...
##  $ product_category_3        : int  NA 14 NA NA NA NA 17 NA NA NA ...
##  $ purchase                  : int  8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...

preliminary summary of the data set

##     user_id            product_id     gender        age        
##  Min.   :1000001   P00265242:  1858   F:132197   0-17 : 14707  
##  1st Qu.:1001495   P00110742:  1591   M:405380   18-25: 97634  
##  Median :1003031   P00025442:  1586              26-35:214690  
##  Mean   :1002992   P00112142:  1539              36-45:107499  
##  3rd Qu.:1004417   P00057642:  1430              46-50: 44526  
##  Max.   :1006040   P00184942:  1424              51-55: 37618  
##                    (Other)  :528149              55+  : 20903  
##    occupation     city_category stay_in_current_city_years
##  Min.   : 0.000   A:144638      0 : 72725                 
##  1st Qu.: 2.000   B:226493      1 :189192                 
##  Median : 7.000   C:166446      2 : 99459                 
##  Mean   : 8.083                 3 : 93312                 
##  3rd Qu.:14.000                 4+: 82889                 
##  Max.   :20.000                                           
##                                                           
##  marital_status   product_category_1 product_category_2 product_category_3
##  Min.   :0.0000   Min.   : 1.000     Min.   : 2.00      Min.   : 3.0      
##  1st Qu.:0.0000   1st Qu.: 1.000     1st Qu.: 5.00      1st Qu.: 9.0      
##  Median :0.0000   Median : 5.000     Median : 9.00      Median :14.0      
##  Mean   :0.4088   Mean   : 5.296     Mean   : 9.84      Mean   :12.7      
##  3rd Qu.:1.0000   3rd Qu.: 8.000     3rd Qu.:15.00      3rd Qu.:16.0      
##  Max.   :1.0000   Max.   :18.000     Max.   :18.00      Max.   :18.0      
##                                      NA's   :166986     NA's   :373299    
##     purchase    
##  Min.   :  185  
##  1st Qu.: 5866  
##  Median : 8062  
##  Mean   : 9334  
##  3rd Qu.:12073  
##  Max.   :23961  
## 

Number of unique customers

## [1] 5891

let’s use some plots to represent the data

this Table represents the main Dataset grouped by user_id

## # A tibble: 6 x 8
## # Groups:   user_id, gender, age, occupation, city_category,
## #   stay_in_current_city_years [6]
##   user_id gender    age occupation city_category
##     <int> <fctr> <fctr>      <int>        <fctr>
## 1 1000001      F   0-17         10             A
## 2 1000002      M    55+         16             C
## 3 1000003      M  26-35         15             A
## 4 1000004      M  46-50          7             B
## 5 1000005      M  26-35         20             A
## 6 1000006      F  51-55          9             A
## # ... with 3 more variables: stay_in_current_city_years <fctr>,
## #   marital_status <int>, total_purchases <int>

summary of the data set grouped by userID

##     user_id        gender      age         occupation     city_category
##  Min.   :1000001   F:1666   0-17 : 218   Min.   : 0.000   A:1045       
##  1st Qu.:1001518   M:4225   18-25:1069   1st Qu.: 3.000   B:1707       
##  Median :1003026            26-35:2053   Median : 7.000   C:3139       
##  Mean   :1003025            36-45:1167   Mean   : 8.153                
##  3rd Qu.:1004532            46-50: 531   3rd Qu.:14.000                
##  Max.   :1006040            51-55: 481   Max.   :20.000                
##                             55+  : 372                                 
##  stay_in_current_city_years marital_status total_purchases   
##  0 : 772                    Min.   :0.00   Min.   :   44108  
##  1 :2086                    1st Qu.:0.00   1st Qu.:  234914  
##  2 :1145                    Median :0.00   Median :  512612  
##  3 : 979                    Mean   :0.42   Mean   :  851752  
##  4+: 909                    3rd Qu.:1.00   3rd Qu.: 1099005  
##                             Max.   :1.00   Max.   :10536783  
## 

The Purchase plot shows a relatively symmetrical distribution with a peak around $7000.I notice some gaps at different purchase levels

Let’s explore the discrete variables

we have more male than female customers the city category C is the one with the most customers the 26-35 year old range is the hignest among the customers

Marital status

the Number of non married customers is higher than the married ones

Most users made purchases less than $5million, but we notice some outliers that made purchases over 7.5 and 10 million

Total purchases by gender

## # A tibble: 2 x 2
##   gender total_purchases
##   <fctr>           <dbl>
## 1      F      1164624021
## 2      M      3853044357

Some products are more popular than others

Univariate Analysis

What is the structure of your dataset?

The data set used has 537577 observations and 12 variables. Each user (which is a customer of the store) is represented by a user_id, age, gender, occupatio, city_category, stay_in_current_city_years, marital_status, and for each transaction we have the product_id, the product_category(A,B,C), and the purchases.

we notice that the most users are from the city category C, but most purchases are from the city category B.

Most users made purchases less than $5 million, but we notice some users (outliers) that made purchases over 7.5 and 10 million dollars

the number of Male shoppers is higher than female shoppers the number of non married shoppers is higner than married shoppers

What is/are the main feature(s) of interest in your dataset?

the dataset can help us predict what profile of users will spend more and on which category of article. The combination of the variables( age, gender,and occupation) can help us with that

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

marital status, city, time spent in the city and the occupation are other feature that should be explored to help with main interest of the dataset exploration

Did you create any new variables from existing variables in the dataset?

no new variables was created.

Of the features you investigated, were there any unusual distributions? you perform any operations on the data to tidy, adjust, or change the form the data? If so, why did you do this?

I grouped the users by user_id in order to study the users and their features get the proportion of users and the total purchases per user

Bivariate Plots Section

Relationship between the age and the total purchases

The 26-35 age bracket is the one with the highest amount of purchases. The 0-17 and 55+ are the one with the smallest amount of purchases. we can also notice some outliers, but in general the median amount of purchases is the same among all age brackets

The difference in total purchases between male and female is not very significant, but we notice much more extremely high purchases among male than female

City C has the lowest level of purchases, it’s median is the lowest among the 3 cities. In city A and city B about 75% of the purchases were about the same amount, but city A has a lot more outliers with very high total purchases.The second plot shows that city B has the highest total amount of purchases and city A has the lowest

Relationship between the occupation and the total purchases

Occupations up to category 7 have made the most purchases

relationship between the marital status and the total purchases

Non married customers made much more purchases than married customers

There are more single male than female, and also more married male than female. it might explain why male made more purchases than female, usually singles can afford to spend more than married customers, but it’s worth more exploration

The 26-35 and 36_45 are the busiest brackets, the hold occupation from all categories

let’s explore the relationship between the age and the city_category

##        
##           A   B   C
##   0-17   25  50 143
##   18-25 214 331 524
##   26-35 461 652 940
##   36-45 176 335 656
##   46-50  53 146 332
##   51-55  67 135 279
##   55+    49  58 265
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 164.34, df = 12, p-value < 2.2e-16

The p_value of the Pearson’s Chi_squared test is very small ( < 0.05), so there is clearely a relationship between the age and the city_category.

The city A has a population mainly between 18 and 45 years old.

let’s explore the relationship between the age and the stay_in_current_city

##        
##           0   1   2   3  4+
##   0-17   31  81  47  33  26
##   18-25 145 357 215 175 177
##   26-35 266 729 384 346 328
##   36-45 150 397 237 216 167
##   46-50  75 192 102  83  79
##   51-55  61 204  78  60  78
##   55+    44 126  82  66  54
## 
##  Pearson's Chi-squared test
## 
## data:  tbl1
## X-squared = 29.058, df = 24, p-value = 0.2179

Contrary to my expectation, there is not a strong relationship between the age and the number of year the person stayed in the same city.

now, let’s explore more about the products bought

## # A tibble: 6 x 10
## # Groups:   product_id, product_category_1, product_category_2,
## #   product_category_3, age, gender, city_category [6]
##   product_id product_category_1 product_category_2 product_category_3
##       <fctr>              <int>              <int>              <int>
## 1  P00000142                  3                  4                  5
## 2  P00000142                  3                  4                  5
## 3  P00000142                  3                  4                  5
## 4  P00000142                  3                  4                  5
## 5  P00000142                  3                  4                  5
## 6  P00000142                  3                  4                  5
## # ... with 6 more variables: age <fctr>, gender <fctr>,
## #   city_category <fctr>, marital_status <int>, total_purchases <int>,
## #   nbre_purchases <int>
##      product_id     product_category_1 product_category_2
##  P00265242:    78   Min.   : 1.000     Min.   : 2.00     
##  P00059442:    77   1st Qu.: 3.000     1st Qu.: 6.00     
##  P00085942:    77   Median : 5.000     Median :11.00     
##  P00086042:    77   Mean   : 6.003     Mean   :10.22     
##  P00251242:    77   3rd Qu.: 8.000     3rd Qu.:15.00     
##  P00028842:    76   Max.   :18.000     Max.   :18.00     
##  (Other)  :120526                      NA's   :48524     
##  product_category_3    age        gender    city_category marital_status  
##  Min.   : 3.0       0-17 : 6542   F:48274   A:36780       Min.   :0.0000  
##  1st Qu.: 9.0       18-25:20575   M:72714   B:45135       1st Qu.:0.0000  
##  Median :14.0       26-35:28236             C:39073       Median :0.0000  
##  Mean   :12.7       36-45:23956                           Mean   :0.4789  
##  3rd Qu.:16.0       46-50:16676                           3rd Qu.:1.0000  
##  Max.   :18.0       51-55:14583                           Max.   :1.0000  
##  NA's   :95221      55+  :10420                                           
##  total_purchases   nbre_purchases   
##  Min.   :    186   Min.   :  1.000  
##  1st Qu.:   7864   1st Qu.:  1.000  
##  Median :  15920   Median :  2.000  
##  Mean   :  41472   Mean   :  4.443  
##  3rd Qu.:  38769   3rd Qu.:  5.000  
##  Max.   :2166488   Max.   :124.000  
## 

Most products were bought less than 192 times, and very few were bought more than 1500 times

Most of the purchases per product are under $2.5 million

The age bracket 26_35 is the one with the biggest number of purchases

## [1] 0.90676

There is a strong correlation between numbre of purchases and the amount spend

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I noticed that there are more buyers from city C, but the amount of purchases is higher in city B, this is worth exploring deeper. I also noticed that occupations up to category 7 have made the most purchases, I will explore more the relationship with the other variables.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I was expecting older people to stay longer in the same city, the data doesn’t show that. we also have more young customers in every city.

What was the strongest relationship you found?

in each city the number of customers of age bracket: 26-35 year old is highest than the other brackets, it is also the bracket with the highest purchases, it is worth exploring if any other varibale is correlated to them

Multivariate Plots Section

Customers in occupation 8 and between 46 and 50 year old are the ones who made the mots purchases

Male in general made more purchases than female and particularly male in the 26-35 year old bracket.

The 26-35 bracket is the one with the most purchases through the 3 cities, but particulariliy in city B and then A

3 main occupations(0,4) with very high purchases

Now, let’s see if single male made more purchases

The data shows that single male made more purchases

The last 2 plots show that male between 26 and 35 year old accross all three cities and occupation are the customers with the most purcahses

This plot shows that the 26-35 age bracket made the biggest number of purchases with high amounts spending

linear model

## 
## Calls:
## m1: lm(formula = total_purchases ~ nbre_purchases, data = by_productid)
## m2: lm(formula = total_purchases ~ nbre_purchases + age, data = by_productid)
## m3: lm(formula = total_purchases ~ nbre_purchases + age + gender, 
##     data = by_productid)
## m4: lm(formula = total_purchases ~ nbre_purchases + age + gender + 
##     city_category, data = by_productid)
## m5: lm(formula = total_purchases ~ nbre_purchases + age + gender + 
##     city_category + marital_status, data = by_productid)
## 
## ==============================================================================================================================================
##                                 m1                      m2                      m3                      m4                      m5            
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                   -8839.627***            -5588.955***            -3287.833***            -2850.852***            -2852.977***  
##                                  (124.981)               (451.301)               (464.332)               (494.969)               (494.647)    
##   nbre_purchases                11323.322***            11506.218***            11579.953***            11594.525***            11613.442***  
##                                   (15.138)                (15.692)                (16.074)                (16.134)                (16.193)    
##   age: 18-25/0-17                                       -5187.658***            -5190.914***            -5047.054***            -6205.428***  
##                                                          (518.004)               (517.111)               (516.977)               (524.744)    
##   age: 26-35/0-17                                      -11074.908***           -11382.109***           -11357.097***           -12781.103***  
##                                                          (506.351)               (505.700)               (506.047)               (518.175)    
##   age: 36-45/0-17                                       -3855.750***            -3821.783***            -3742.798***            -5044.605***  
##                                                          (508.864)               (507.990)               (507.817)               (517.882)    
##   age: 46-50/0-17                                        -342.182                   7.043                 138.222               -1542.343**   
##                                                          (530.928)               (530.287)               (529.730)               (545.906)    
##   age: 51-55/0-17                                         724.897                 965.540                1135.432*               -500.159     
##                                                          (541.540)               (540.734)               (540.236)               (555.248)    
##   age: 55+/0-17                                          1471.907*               2175.650***             1852.037**               268.227     
##                                                          (574.050)               (574.091)               (573.889)               (587.109)    
##   gender: M/F                                                                   -4494.113***            -4601.812***            -4662.282***  
##                                                                                  (219.508)               (219.679)               (219.588)    
##   city_category: B/A                                                                                    -2641.229***            -2678.691***  
##                                                                                                          (257.176)               (257.025)    
##   city_category: C/A                                                                                     1522.495***             1540.340***  
##                                                                                                          (266.230)               (266.060)    
##   marital_status                                                                                                                 2745.671***  
##                                                                                                                                  (217.760)    
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                         0.822                   0.825                   0.825                   0.826                   0.826     
##   adj. R-squared                    0.822                   0.825                   0.825                   0.826                   0.826     
##   sigma                         36640.158               36390.739               36328.010               36285.909               36262.240     
##   F                            559527.698               81270.575               71409.945               57288.886               52163.269     
##   p                                 0.000                   0.000                   0.000                   0.000                   0.000     
##   Log-likelihood             -1443124.344            -1442294.932            -1442085.696            -1441944.403            -1441864.957     
##   Deviance            162423846693851.906     160212107672853.438     159658924624998.250     159286450786954.969     159077400095792.469     
##   AIC                         2886254.688             2884607.864             2884191.392             2883912.805             2883755.914     
##   BIC                         2886283.798             2884695.195             2884288.427             2884029.247             2883882.059     
##   N                            120988                  120988                  120988                  120988                  120988         
## ==============================================================================================================================================

I created the linear model, and the R-squared is significant. This model can be used to predict the amount spent and the number of purchases by customer. The variables used account for 82.6% in the total purchases of a customer

Multivariate Analysis

Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?

single male have the hignest amount of purchases. the 26-35 year old customers from all city categories have the hights amount of purchases. and the number of product bought was correlated to the amount spent.

Were there any interesting or surprising interactions between features?

there was no big suprises in this part of the investigation. but I was surrise that the variable occupation didn’t have that a big influence on the rest of the variables.

OPTIONAL: Did you create any models with your dataset? Discuss the and limitations of your model.

the linear model I created shows that the variables : age, gender, city_category, and marital status are very important in the prediction of the amount spent. It would have been very nice if I was able to predict the product id of the product that the customer would buy using the variables.


Final Plots and Summary

Plot One

Description One

the plot shows the purchases made by each customer, the plot is skewed to the left, which shows that the bulk of the amounts spent is less than 2.5 million dollar. it also shows that there was some extremely high amount spent, they represent the outliers of this analysis

Plot Two

Description Two

I like the fact that there is a correlation between the number of purchases and the amount spent, it shows that customers were buying a lot of products and not necessarily very expensive one. it can be an indiaction that there is not a lot of outliers

Plot Three

Description Three

these 2 plots show that the purchases were made mainly by the 26-35 year old male , and this accross the 3 cities. some occupation were mode frequent than others, but doens’t really have a big influence on the amount of purhases. We can notice that cities A and B are the ones with the most purcahses. so these 2 plots resume it all and tell us what variables influenced the purchases. ——

Reflection

it was very interesting to explore this dataset, the good part is that there was no cleaning to make on it. I am a big shopper so I enjoyed working on this particular dataset because I always wonder how big retail stores manage and use their data to explore their customers and their shopping habits. the difficult part was to find the right functions in R for a particular little action on a plot. we learn a lot in the classes, we absorb a lot of new information but it is very difficult to remember all of them and to find the right one to use when needed. I guess a lot practice is always needed to learn a new language. overall, I enjoyed working on this project! I wish I have more time to explore this data set further, and to explore the other side of it, which is what are the most popular products. it can be a good future project for when I finish my Nano degree.